Tutorials & Notebooks


Workflow orchestration

Workflow orchestration in the ai-on-gke project involves managing and automating the execution of complex, multi-step processes, primarily for AI/ML workloads on Google Kubernetes Engine (GKE).

Fine-tuning Gemma 3-1B-it on L4

This tutorial guides you through fine-tuning the Gemma 3-1B-it language model on Google Kubernetes Engine (GKE) using L4 GPU, leveraging Parameter Efficient Fine Tuning (PEFT) and LoRA. It covers setting up a GKE cluster, containerizing the fine-tuning code, running the fine-tuning job, and uploading the resulting model to Hugging Face. Finally, it demonstrates how to deploy and interact with the fine-tuned model using vLLM on GKE.

Running Flyte on GKE

This guide illustrates the deployment of Flyte on Google Kubernetes Engine (GKE) using Helm, utilizing Google Cloud Storage for scalable data storage and Cloud SQL PostgreSQL for a reliable metadata store. By the end of this tutorial, you will have a fully functional Flyte instance on GKE, offering businesses seamless integration with the GCP ecosystem, improved resource efficiency, and cost-effectiveness.

Populate a Hyperdisk ML Disk from Google Cloud Storage

This guide uses the Google Cloud API to create a Hyperdisk ML disk from data in Cloud Storage and then use it in a GKE cluster. Refer to this documentation for instructions all in the GKE API.

Inference servers

Deploying and managing servers dedicated to performing inference tasks for machine learning models.

Deploying a Persistent Chatbot on Google Cloud Platform with LangChain, Streamlit, and IAP

In this tutorial, you will learn how to deploy a chatbot application using LangChain and Streamlit on Google Cloud Platform (GCP).

Llamaindex in GKE cluster

This tutorial will guide you through creating a robust Retrieval-Augmented Generation (RAG) system using LlamaIndex and deploying it on Google Kubernetes Engine (GKE).

Fine-Tuning Gemma 2-9B on GKE using Metaflow and Argo Workflows

This tutorial will provide instructions on how to deploy and use the Metaflow framework on GKE (Google Kubernetes Engine) and operate AI/ML workloads using Argo-Workflows.

Fine-tune gemma-2-9b and track as an experiment in MLFlow

In this tutorial we will fine-tune gemma-2-9b using LoRA as an experiment in MLFlow. We will deploy MLFlow on a GKE cluster and set up MLFlow to store artifacts inside a GCS bucket. In the end, we will deploy a fine-tuned model using KServe.

Models as OCI

This project allows you to download a Hugging Face model and package it as a Docker image. The Docker image can then be pushed to Google Artifact Registry for deployment or distribution. Build time can be significant for large models, it is recommended to not exceed models above 10 billion parameters. For reference 8b model roughly takes 35 minutes to build and push with this cloudbuild config.

Using TPUs with KubeRay on GKE

This guide provides instructions for deploying and managing Ray custom resources on Google Kubernetes Engine (GKE) with TPUs. It details how to install the KubeRay TPU webhook, an admission webhook which bootstraps required environment variables for TPU initialization and enables atomic scheduling of multi-host TPU workers on GKE nodepools. This guide also provides a sample workload to verify proper TPU initialization and links to more advanced workloads to run with TPUs and Ray on GKE.

Skypilot

In this tutorial, we will demonstrate how to leverage the open-source software SkyPilot to help GKE customers efficiently obtain accelerators across regions, ensuring workload continuity and optimized resource utilization.

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with vLLM

Continue reading: